Week 2.2 - Training Large Language Models

What We'll Cover

Last session, we dissected the transformer architecture—the structure of modern LLMs. But how do billions of parameters actually learn to generate coherent, knowledgeable text? This session covers the training process: how we go from random weights and massive text corpora to models that can write, reason, and assist with research.

We'll explore the pre-training objective, the scale of data required, the optimization techniques that make training feasible, and the staggering computational costs involved. We'll also examine scaling laws that predict model performance and the parallelism strategies that enable training at unprecedented scales.

Key question: Why does training a frontier model cost tens of millions of dollars—and what are researchers buying with that money?

🎯 Pre-Training Objectives

Before we can train a model, we need a learning objective: what task should the model solve during training? Modern LLMs primarily use one deceptively simple objective.

🔑 Next-Token Prediction

The dominant pre-training objective for modern decoder-only LLMs: given a sequence of tokens, predict the next token.

In many training pipelines this is the main loss, and it’s powerful: applied to massive corpora, it encourages the model to capture grammar, factual associations, and patterns of reasoning. Some training setups also add auxiliary losses or curriculum/mixing strategies (especially for multimodal or retrieval-augmented systems).

Autoregressive Language Modeling

The dominant approach for decoder-only models (GPT, Claude, LLaMA):

Input: Sequence of tokens [t₁, t₂, ..., tₙ]
Task: Predict tₙ₊₁ given [t₁, ..., tₙ]
Training signal: Cross-entropy loss between predicted distribution and actual next token
Causal masking: Model can only see previous tokens, not future ones
Example: "The capital of France is" → model should assign high probability to "Paris"

Why This Works

Next-token prediction seems simple, but it pushes the model to represent:

Syntax: Grammatical structure, word order, punctuation
Semantics: Context-dependent meaning and reference
Regularities: Statistical structure of language, code, and math-like text
Factual patterns: Associations present in the corpus (not guaranteed truth)
Inference patterns: Common reasoning templates that help predict what comes next

💡 Self-Supervised Learning

No human labels needed—the text itself provides supervision. This is why LLMs can train on very large corpora without manual annotation.

Alternative Objectives (Historical and Still Useful)

Other pre-training approaches:

Masked Language Modeling (BERT): Mask random tokens, predict them from context (bidirectional)
Encoder-Decoder (T5): Span corruption/reconstruction tasks
Prefix LM: Hybrid approach with bidirectional prefix and autoregressive suffix

Why decoder-only is common: Simple, scales well, and is directly suited to open-ended generation. That said, encoder–decoder and masked objectives remain competitive for some tasks (e.g., translation, certain structured generation settings).

📹 A little more on next-token prediction

📚 Training Data: Scale, Quality, and Curation

The data you train on determines what your model can pick up. Modern LLMs are trained on enormous token counts from diverse sources—but data quality and composition matter as much as quantity.

📊 Data Scale Evolution

Training data has grown dramatically:

Model	Year	Training Tokens	Data Sources
GPT-2	2019	~8 billion	WebText (outbound Reddit links)
GPT-3	2020	~300 billion	Filtered Common Crawl, WebText2, Books, Wikipedia (mixture described in the paper)
GPT-4	2023	~13 trillion (unreported, speculative)	Undisclosed; multimodal data included in GPT-4V-style systems
LLaMA 2	2023	2 trillion	Publicly available data; no Meta user data (per authors)
LLaMA 3	2024	~15 trillion	More diverse sources; stronger filtering and quality controls (per authors)

Trend (often claimed): As public web text becomes noisier and more saturated, labs increasingly emphasize filtering, licensing, domain balancing, and synthetic data. The extent of “data scarcity” depends heavily on definitions and access to private/licensed corpora.

Common Data Sources

Web crawls: Common Crawl, C4 (Colossal Clean Crawled Corpus)
Books: Project Gutenberg, other book corpora (often legally and ethically contentious)
Wikipedia: High-quality encyclopedic knowledge
Code repositories: GitHub, Stack Overflow (for coding + structured text patterns)
Scientific papers: arXiv, PubMed, and academic corpora
Forums/conversations: Reddit and other forums (with heavy filtering)

Data Curation Challenges

Quality filtering: Removing spam, boilerplate, and low-information text
Deduplication: Reducing repeats that inflate token counts without adding signal
Toxicity: Filtering harmful or unsafe content (imperfectly)
Privacy: Removing personal/sensitive information
Copyright: Legal questions about using copyrighted material
Contamination: Preventing evaluation/test sets from leaking into training data

Data Composition Matters

It's not just volume—the mix of data types affects model behavior:

Code data: Often correlates with better performance on structured reasoning tasks (causal mechanism debated)
Mathematical text: Can improve symbolic/quantitative patterns (often still brittle)
Multilingual data: Enables cross-lingual transfer and broader coverage
Instruction-like data: Some pipelines include instruction-following examples earlier than “alignment” (varies by lab)
Balance: Over-representing one domain biases style and knowledge

🔧 Tokenization: Preparing Text for Training

Before training, text must be converted to tokens:

BPE (Byte-Pair Encoding): Common for GPT/LLaMA-family tokenizers. Iteratively merges frequent symbol pairs.
WordPiece: Used by BERT-style models; similar spirit to BPE with a different objective.
SentencePiece: A tokenizer toolkit that can implement BPE or Unigram LM; trains from raw text without pre-tokenization, often with byte fallback options.
Vocabulary size: Often 32K–100K tokens. Larger vocab can reduce sequence length but increases embedding/softmax sizes.

Trade-off: Finer tokens = more flexible, longer sequences. Coarser tokens = shorter sequences, less flexibility for rare words/morphology.

📄 Reading Resources: Data Curation, Scaling Laws, and Data Mixtures

🧠 Core Papers

Chinchilla scaling laws / compute-optimal training
Training Compute-Optimal Large Language Models (Hoffmann et al., 2022).
Key idea: for a fixed compute budget, many models are under-trained on too few tokens; compute-optimal regimes often favor training on more tokens per parameter.
arXiv:2203.15556
Earlier scaling laws
Scaling Laws for Neural Language Models (Kaplan et al., 2020).
Establishes empirical scaling relationships and provides a baseline lens for thinking about compute/data/model size regimes (pre-Chinchilla).
arXiv:2001.08361
GPT-3 data mixture
Language Models are Few-Shot Learners (Brown et al., 2020).
Describes the training data mixture (filtered Common Crawl + WebText2 + Books + Wikipedia) and why mixture design matters.
arXiv:2005.14165
LLaMA 2 data + filtering
Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023).
Data sources, filtering, and high-level curation choices from a major open model.
arXiv:2307.09288

GPT-4 Technical Report
Not detailed on datasets, but useful for seeing what major labs disclose and how they frame data/safety constraints at a high level.
arXiv:2303.08774

🎓 Practical Guides / Hands-On References

Hugging Face Course (data prep + tokenization basics)
Tokenization, dataset preparation, and evaluation hygiene.
huggingface.co/learn
Instruction-tuning dataset construction (for curation intuition)
Alpaca-style pipelines illustrate balancing, formatting, and contamination pitfalls (more finetuning than pretraining).
Stanford Alpaca (GitHub)

📹 LLM Tokenizers explained

⚙️ Optimization: Making Training Work

Training a billion-parameter model requires sophisticated optimization techniques. You can't just run vanilla gradient descent and expect it to work!

🎯 The Optimization Challenge

Training LLMs means finding good values for billions of parameters in a high-dimensional space. The loss landscape is non-convex, and training is computationally expensive and sensitive to hyperparameters.

Modern LLM training relies on: adaptive optimizers (Adam variants), careful learning rate schedules, large global batch sizes, and gradient clipping to reduce instability.

Optimizers

Algorithms for updating parameters based on gradients:

SGD: Foundational but often slower/harder to tune at LLM scale.
Adam: Adaptive learning rates per parameter; common baseline.
AdamW: Adam with decoupled weight decay; a standard choice for LLM training.
Adafactor: Memory-efficient variant used in some very large models.
Lion: A recent optimizer sometimes explored for efficiency; usage varies and is not universal.

💡 Why Adam-style optimizers?

They maintain momentum and adaptive step sizes, which often stabilizes training across very heterogeneous parameter scales.

Learning Rate Schedules

Learning rate determines step size during optimization. Too high = instability; too low = slow convergence.

Warmup: Start small and ramp up (reduces early instability)
Cosine decay: Common after warmup
Linear decay: Another standard schedule
Typical peak LR: often around 1e-4 to 3e-4 in many public recipes (depends strongly on batch size/model size)

Batch Size & Gradient Accumulation

Batch size: Number of sequences/tokens processed before updating parameters
Large global batches: In many large-scale runs the global batch corresponds to millions of tokens, but it varies by model and hardware
Why large? Stabilizes training and improves hardware utilization
Gradient accumulation: If you can't fit the batch, accumulate gradients over multiple forward passes
Trade-off: Extremely large batches can alter generalization dynamics; effects depend on regime

🛡️ Preventing Training Instability

Training can diverge due to exploding gradients or numerical issues. Common mitigations:

Gradient clipping: Cap gradient norm (often around 1.0, but not universal)
Mixed precision: Use lower precision for speed, with higher precision accumulation to reduce under/overflow
Normalization: LayerNorm/RMSNorm stabilizes activations
Warmup: Gradual LR ramp prevents early divergence
Checkpointing: Reload from recent stable checkpoint after spikes/failures

Reality check: Large-scale training can still fail; robust checkpointing and monitoring are essential.

📹 ChatGPT 5.2 explanation of the Adam Optimizer and Learning Rate Scheduling

Here you will find the explanation that I got to the above using ChatGPT 5.2. I have a very specific preprompt that I use. Try the same prompt in different AIs and see how good, or bad, each one is.

💰 Computational Resources & Costs

Training frontier LLMs requires staggering amounts of compute. Let's quantify exactly what that means.

🔢 Understanding FLOPs

FLOP = Floating Point Operation (a single addition or multiplication)

A common dense-transformer back-of-the-envelope estimate:

Training compute: ≈ 6 FLOPs per parameter per token
Forward pass: ≈ 2 FLOPs/param/token
Backward pass: ≈ 4 FLOPs/param/token

Caveat: Real compute differs with architecture and implementation (attention vs MLP ratios, sequence length effects, activation checkpointing, MoE routing, etc.). Use this as an order-of-magnitude estimate.

Example: Training a 7B parameter model on 1T tokens ≈ 6 × 7B × 1T = 42 × 10²¹ FLOPs = 42 zettaFLOPs

💸 Cost Estimates for Famous Models

Compute requirements and estimated costs (highly approximate):

Model	Parameters	Training Tokens	Compute (FLOPs)	GPU-Days (A100)	Estimated Cost
GPT-3 175B	175B	300B	~3.1 × 10²³	~11,500	~$5–10M
LLaMA 2 70B	70B	2T	~8.4 × 10²³	~31,000	~$15–20M
GPT-4 (speculative dense equivalent)	~1.76T (unconfirmed)	~13T (unconfirmed)	~1.4 × 10²⁶	~5,000,000	~$100M+
Claude Opus 4.5 (speculative dense equivalent)	~500B (unconfirmed)	~10T (unconfirmed)	~3 × 10²⁵	~1,100,000	~$50–70M

Note: These are rough estimates. Modern frontier models may use Mixture-of-Experts (MoE), where “parameters” and “active compute per token” diverge. Costs also depend on utilization, restarts, networking overheads, and whether compute is rented or owned.

Hardware Requirements

GPUs: Thousands of A100/H100/H200 GPUs running in parallel
Interconnect: High-bandwidth NVLink / InfiniBand for GPU communication
Storage: Large, fast storage for streaming training data
Clusters: Datacenters with substantial power and cooling constraints
Cost per GPU-hour (cloud): Varies widely by provider/region/commitment; treat any single number as a moving target

Training Time

How long does it take?

Small models (7B): Days to weeks on moderate clusters
Mid-size (70B): Weeks to months on large clusters
Frontier (1T+): Months on massive clusters (10K+ GPUs)
Bottleneck: Often GPU communication, not raw compute
Restarts: Training runs fail; checkpointing every few hours is critical

⚡ Energy Consumption

Energy use depends on hardware, datacenter efficiency (PUE), utilization, and retries. Published numbers vary substantially across sources and assumptions.

Rule of thumb for impact discussions: inference can dominate total energy/carbon footprint for widely deployed models, but this depends on deployment scale and usage patterns.

📈 Scaling Laws: Predicting Performance

Can we predict how good a model will be before spending millions on training? Scaling laws say yes—with important caveats.

🔮 The Scaling Law Hypothesis

Model performance (often measured by loss on held-out data) follows predictable power laws as a function of:

Model size (N): Number of parameters
Dataset size (D): Number of training tokens
Compute (C): FLOPs spent on training

These relationships let labs extrapolate performance and plan expensive training runs.

Kaplan Scaling Laws (2020)

OpenAI's early scaling-law analysis suggested:

Smooth power laws: Test loss decreases predictably with scale
Weak sensitivity to shape: Depth vs width mattered less than total scale in their regime
Overfitting was limited: With large datasets, larger models continued improving
Takeaway (in that regime): Scaling model size was an effective lever

Chinchilla Scaling Laws (2022)

DeepMind revised the story: data matters more than previously assumed.

Compute-optimal training: Model size and tokens should scale together
Rule of thumb: ~20 tokens per parameter for compute-optimal dense training (in their setting)
GPT-3 undertrained (by that criterion): 175B params on 300B tokens; compute-optimal would use more tokens
Chinchilla: 70B params on ~1.4T tokens outperformed some larger-but-undertrained models

💡 The Chinchilla Insight

If you have a fixed compute budget, it can be better to train a smaller model on more data than a huge model on too little data.

Implications for Research

Predictability: Early loss curves can often forecast final loss
Allocation choices: Labs trade off model size vs tokens based on product constraints
Algorithmic progress: Architecture/optimizer improvements matter, but scale remains a dominant driver
Sharp transitions: Some benchmarks show abrupt-looking capability jumps; whether these are intrinsic or evaluation-dependent is debated

📊 Training for Compute vs Training for Inference Cost

“Compute-optimal” usually means best loss for a fixed training compute budget. Product teams may instead optimize for low inference cost at a target quality, which can justify extra training (more tokens) to make a smaller model good enough.

Consideration	Compute-Optimal (fixed training compute)	Inference-Targeted (fixed target quality)
Model size	Often smaller (paired with more tokens)	Often smaller (to reduce inference cost), but trained longer
Training tokens	Scaled with parameters (e.g., ~20 tokens/param in Chinchilla regime)	May exceed compute-optimal tokens to reach a target quality with a smaller model
Training cost	Optimized for given budget	Potentially higher (extra training to reduce deployment costs)
Inference cost	Not the primary objective; depends on the resulting model size	Explicitly minimized (smaller model for a target quality)
Best for	Research planning; compute budgeting; fast iteration	Deployed systems where serving cost dominates

Example framing: A lab might spend extra training compute to make a smaller model reach the desired quality, because the smaller model is dramatically cheaper to serve at scale.

🔀 Parallelism: Training at Scale

A single GPU can't hold a 70B parameter model, let alone train it. Modern LLM training requires distributing the model and data across many GPUs using parallelism strategies.

Data Parallelism

The simplest approach: replicate the model across GPUs and split data across them.

Setup: Each GPU holds a complete model replica
Process: Different GPUs process different batches
Synchronization: Aggregate gradients across GPUs
Pros: Simple, good utilization
Cons: Memory-limited—very large models won’t fit on one GPU
Used for: Smaller models or when combined with sharding

Model Parallelism (Tensor Parallel)

Split individual layers across GPUs.

Setup: Each transformer layer split across multiple GPUs
Example: Attention/MLP matrices split across GPUs
Process: Frequent intra-layer communication
Pros: Can scale to very large models
Cons: Communication overhead
Used for: Models too large for a single GPU

Pipeline Parallelism

Split model into stages—each GPU handles specific layers.

Setup: Layers partitioned into pipeline stages
Microbatching: Keeps all stages busy
Pros: Less intra-layer comms than tensor parallel
Cons: Pipeline “bubbles” (idle time), scheduling complexity
Used for: Very large models (often with tensor parallel)

🎯 Combining Parallelism: 3D Parallelism

Frontier training often combines multiple strategies:

Example: Training a 175B model on 1024 GPUs

Data parallelism: 16-way
Pipeline parallelism: 8-way
Tensor parallelism: 8-way
Total: 16 × 8 × 8 = 1024 GPUs

Challenge: Tuning these dimensions is nontrivial: too much tensor parallelism can bottleneck on communication; too much pipeline parallelism increases bubble overhead.

🛠️ Frameworks & Tools

Libraries that implement these parallelism strategies:

DeepSpeed (Microsoft): ZeRO optimizations and multi-parallel strategies
Megatron-LM (NVIDIA): Efficient tensor and pipeline parallelism for transformers
FSDP (PyTorch): Fully Sharded Data Parallel (sharding + data parallel concepts)
JAX/Flax: Flexible parallelism used in some large-scale research systems

📹 GPU explainer

📹 JAX explainer

📚 Summary & Key Takeaways

You now understand how LLMs are trained at scale:

Pre-training objective: Next-token prediction is the dominant driver for decoder-only LMs (sometimes with additional training tricks)
Training data: Massive corpora from curated web, books, code, papers—quality and composition matter
Optimization: AdamW + LR schedules + large global batches + stability techniques
Compute costs: Large-scale training requires enormous compute; exact costs depend on hardware and efficiency
Scaling laws: Loss often scales predictably with compute; Chinchilla emphasizes balancing model size and tokens
Parallelism: Data + tensor + pipeline parallelism distributes training across many GPUs

Next session (Week 2.3): We'll explore what happens after pre-training—fine-tuning, RLHF, and alignment techniques that turn raw language models into helpful AI assistants.